-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingestion): Adding ability to ignore users from top users calculation #3735
feat(ingestion): Adding ability to ignore users from top users calculation #3735
Conversation
Here is an example how it looks like if you add the following to the usage config and only that user queried the table:
Every other stat contains those queries except the top users. |
580b599
to
7045348
Compare
@@ -112,6 +116,7 @@ def make_usage_workunit( | |||
|
|||
class BaseUsageConfig(BaseTimeWindowConfig): | |||
top_n_queries: pydantic.PositiveInt = 10 | |||
ignore_user_emails_from_usage: Optional[List[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this an AllowDenyPattern type to be consistent with rest of the pattern based filters we have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I changed it.
@@ -27,6 +27,7 @@ | |||
class GenericAggregatedDataset(Generic[ResourceType]): | |||
bucket_start_time: datetime | |||
resource: ResourceType | |||
ignore_user_emails: Optional[List[str]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change it to an AllowDenyPattern type.
@@ -0,0 +1,174 @@ | |||
from datetime import datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this added by mistake? Can we merge this with test_usage_common.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it was a mistake. I removed it.
self.userFreq[user_email] += 1 | ||
if not self.ignore_user_emails or user_email not in self.ignore_user_emails: | ||
self.userFreq[user_email] += 1 | ||
|
||
if query: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not dropping the query made by an ignored user? Should we not just drop the entry altogether, if we don't want the data from the user pollute the usage stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add in here, when we first discussed this feature we agreed to not drop them from totals as it was a heavier touch operation. I think we have enough signal to know that it would make sense to drop these users from sum total as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the logic to skip these users from every count.
@@ -27,6 +28,7 @@ | |||
class GenericAggregatedDataset(Generic[ResourceType]): | |||
bucket_start_time: datetime | |||
resource: ResourceType | |||
ignore_user_emails: AllowDenyPattern = AllowDenyPattern.allow_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ignore_user_emails/user_email_pattern/. The deny list will imply ignore.
- Corresponding documentation updates.
test_email2 = "[email protected]" | ||
test_query = "select * from test" | ||
test_query2 = "select * from test2" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove unnecessary new lines here.
def test_add_one_query_with_ignored_user(): | ||
test_email = "[email protected]" | ||
test_query = "select * from test" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove unnecessary new lines here.
| `start_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Earliest date of usage logs to consider. | | ||
| `end_time` | | Last full day in UTC (or hour, depending on `bucket_duration`) | Latest date of usage logs to consider. | | ||
| `top_n_queries` | | `10` | Number of top queries to save to each table. | | ||
| `ignore_user_emails_from_usage.allow` | | | List of regex patterns for user emails to include in usage. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ignore_user_emails_from_usage/user_email_pattern.
AggregatedDataset( | ||
bucket_start_time=floored_ts, | ||
resource=resource, | ||
ignore_user_emails=self.config.ignore_user_emails_from_usage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't these gotten renamed?
AggregatedDataset( | ||
bucket_start_time=floored_ts, | ||
resource=resource, | ||
ignore_user_emails=self.config.ignore_user_emails_from_usage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't these gotten renamed?
AggregatedDataset( | ||
bucket_start_time=floored_ts, | ||
resource=resource, | ||
ignore_user_emails=self.config.ignore_user_emails_from_usage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't these gotten renamed?
bucket_start_time=floored_ts, resource=resource | ||
bucket_start_time=floored_ts, | ||
resource=resource, | ||
ignore_user_emails=self.config.ignore_user_emails_from_usage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't these gotten renamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
9e4b166
to
5748cb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Checklist